{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据处理和特征工程 — 构造特征  \n",
    "1.构造lfm_reco    \n",
    "2.构造user_pop和item_pop，user_rate和item_rate  \n",
    "3.生成最终训练文件  \n",
    "4.生成最终测试文件  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们认为仅仅利用原始特征是不够的，我们希望从数据中挖掘更多有用信息，因此我们另外构造了5个特征，分别是：  \n",
    "用户活跃度user_pop，歌曲热度item_pop，用户订阅率user_rate，歌曲订阅率item_rate，LFM推荐度lfm_reco，并且写入原始训练和测试文件，得到最终训练和测试文件train_final.csv和test_final.csv。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from lightgbm.sklearn import LGBMClassifier\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "dpath = \"./data/\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.构造lfm_reco"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练LFM模型，得到用户隐向量user_vec和歌曲隐向量item_vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lfm_train(train_data, F, alpha, beta, step):\n",
    "    \"\"\"\n",
    "    train LFM model,get latent factor user_vec and item_vec\n",
    "    Args:\n",
    "        train_data: train_data for lfm\n",
    "        F: user vector len, item vector len\n",
    "        alpha:regularization factor\n",
    "        beta: learning rate\n",
    "        step: iteration number\n",
    "    Return:\n",
    "        dict: key itemid, value:np.ndarray\n",
    "        dict: key userid, value:np.ndarray\n",
    "    \"\"\"\n",
    "    user_vec = {}\n",
    "    item_vec = {}\n",
    "    count = 0\n",
    "    for step in range(step):\n",
    "        fin = open(dpath+train_data,\"r+\")\n",
    "        start = 0\n",
    "        for line in fin:\n",
    "            if start == 0:\n",
    "                start += 1\n",
    "                continue\n",
    "            cols = line.strip().split(\",\")\n",
    "            userid,itemid,target = cols[0],cols[1],cols[-1]\n",
    "            if userid not in user_vec:\n",
    "                user_vec[userid] = np.random.randn(F)\n",
    "            if itemid not in item_vec:\n",
    "                item_vec[itemid] = np.random.randn(F)\n",
    "            #target是str，需转换为int\n",
    "            delta = int(target)-lfm_score(user_vec[userid],item_vec[itemid])\n",
    "            for i in range(F):\n",
    "                user_vec[userid][i] += beta*(delta*item_vec[itemid][i]\\\n",
    "                                            -alpha*user_vec[userid][i])\n",
    "                item_vec[itemid][i] += beta*(delta*user_vec[userid][i]\\\n",
    "                                            -alpha*item_vec[itemid][i])\n",
    "            count += 1\n",
    "            #第1轮不更新学习率\n",
    "            if step == 0:\n",
    "                continue\n",
    "            #每200000个样本更新一次学习率\n",
    "            if count%200000==0:\n",
    "                beta *= 0.95\n",
    "            if count%1000000==0:\n",
    "                print(\"step %d,count %d,learning rate %g:\"%(step, count, beta))\n",
    "    pickle.dump(user_vec,open(dpath+\"user_vec.pkl\",\"wb\"))\n",
    "    pickle.dump(item_vec,open(dpath+\"item_vec.pkl\",\"wb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据user_vec和item_vec计算基于LFM的推荐度得分lfm_reco"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lfm_score(user_vector,item_vector):\n",
    "    \"\"\"\n",
    "    user_vector and item_vector distance\n",
    "    Args:\n",
    "        user_vector: lfm model produce user vector\n",
    "        item_vector: lfm model produce item vector\n",
    "    Return:\n",
    "         lfm recommend score\n",
    "    \"\"\"\n",
    "    score = np.dot(user_vector, item_vector)/\\\n",
    "                (np.linalg.norm(user_vector)*np.linalg.norm(item_vector))\n",
    "    return score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "step 1,count 8000000,learning rate 0.0814506:\n",
      "step 1,count 9000000,learning rate 0.0630249:\n",
      "step 1,count 10000000,learning rate 0.0487675:\n",
      "step 1,count 11000000,learning rate 0.0377354:\n",
      "step 1,count 12000000,learning rate 0.0291989:\n",
      "step 1,count 13000000,learning rate 0.0225936:\n",
      "step 1,count 14000000,learning rate 0.0174825:\n",
      "step 2,count 15000000,learning rate 0.0135276:\n",
      "step 2,count 16000000,learning rate 0.0104674:\n",
      "step 2,count 17000000,learning rate 0.00809947:\n",
      "step 2,count 18000000,learning rate 0.00626722:\n",
      "step 2,count 19000000,learning rate 0.00484945:\n",
      "step 2,count 20000000,learning rate 0.00375241:\n",
      "step 2,count 21000000,learning rate 0.00290355:\n",
      "step 2,count 22000000,learning rate 0.00224671:\n",
      "step 3,count 23000000,learning rate 0.00173846:\n",
      "step 3,count 24000000,learning rate 0.00134519:\n",
      "step 3,count 25000000,learning rate 0.00104088:\n",
      "step 3,count 26000000,learning rate 0.000805413:\n",
      "step 3,count 27000000,learning rate 0.000623214:\n",
      "step 3,count 28000000,learning rate 0.000482231:\n",
      "step 3,count 29000000,learning rate 0.000373141:\n",
      "step 4,count 30000000,learning rate 0.000288729:\n",
      "step 4,count 31000000,learning rate 0.000223413:\n",
      "step 4,count 32000000,learning rate 0.000172873:\n",
      "step 4,count 33000000,learning rate 0.000133766:\n",
      "step 4,count 34000000,learning rate 0.000103505:\n",
      "step 4,count 35000000,learning rate 8.00905e-05:\n",
      "step 4,count 36000000,learning rate 6.19725e-05:\n",
      "Wall time: 1h 12min 8s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "lfm_train(\"train_merge.csv\", 60, 0.01, 0.1, 5)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.构造user_pop和item_pop，user_rate和item_rate  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据用户历史行为数据得到user_record和item_record  \n",
    "user_record = {userid: [value1,value2],...}  \n",
    "item_record = {itemid: [value1,value2],...}  \n",
    "用户活跃度user_pop=value2，用户订阅率user_rate=value2/value1  \n",
    "歌曲热度item_pop=value2，歌曲订阅率item_rate=value2/value1  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_record_file(input_file):\n",
    "    \"\"\"\n",
    "    get user_record dict and item_record dict\n",
    "    user_record = {userid: [value1,value2],...}\n",
    "    item_record = {itemid: [value1,value2],...}\n",
    "    args:\n",
    "        input_file: user item record file - train_merge.csv\n",
    "    \"\"\"\n",
    "    fin = open(dpath+input_file,\"r+\")\n",
    "    user_record = {}\n",
    "    item_record = {}\n",
    "    start = 0\n",
    "    for line in fin:\n",
    "        if start == 0:\n",
    "            start += 1\n",
    "            continue\n",
    "        cols = line.strip().split(\",\")\n",
    "        userid,itemid,target = cols[0],cols[1],cols[-1]\n",
    "        if userid not in user_record:\n",
    "            user_record[userid] = [0,0]\n",
    "        user_record[userid][0] += 1\n",
    "        #TypeError: unsupported operand type(s) for +=: 'int' and 'str'  \n",
    "        #int(target)\n",
    "        user_record[userid][1] += int(target)\n",
    "        if itemid not in item_record:\n",
    "            item_record[itemid] = [0,0]\n",
    "        item_record[itemid][0] += 1\n",
    "        item_record[itemid][1] += int(target)\n",
    "    pickle.dump(user_record,open(dpath+\"user_record.pkl\",\"wb\"))\n",
    "    pickle.dump(item_record,open(dpath+\"item_record.pkl\",\"wb\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 21.8 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "get_record_file(\"train_merge.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_mean_std(user_record,item_record):\n",
    "    \"\"\"\n",
    "    get user_pop mean and std, get item_pop mean and std\n",
    "    args:\n",
    "        user_record: user_record dict\n",
    "        item_record: item_record dict\n",
    "    return:\n",
    "        user_pop_mean,user_pop_std,item_pop_mean,item_pop_std\n",
    "    \"\"\"\n",
    "    user_pop_mean = np.mean(list(map(lambda x:x[1],user_record.values())))\n",
    "    user_pop_std = np.std(list(map(lambda x:x[1],user_record.values())))\n",
    "    \n",
    "    item_pop_mean = np.mean(list(map(lambda x:x[1],item_record.values())))\n",
    "    item_pop_std = np.std(list(map(lambda x:x[1],item_record.values())))\n",
    "    \n",
    "    return user_pop_mean,user_pop_std,item_pop_mean,item_pop_std"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### 3. 生成最终训练文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "把user_pop,item_pop,user_rate,item_rate,lfm_reco写入文件保存，得到train_final.csv "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "user_rate和item_rate取值为何一样  \n",
    "outcols item_rate写成user_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_train_final(input_file,output_file):\n",
    "    \"\"\"\n",
    "    generate final train file\n",
    "    args:\n",
    "        input_file: input file path\n",
    "        output_file: output file path\n",
    "    \"\"\"\n",
    "    fin = open(dpath+input_file,\"r+\")\n",
    "    fout = open(dpath+output_file,\"w+\")\n",
    "    user_vec = pickle.load(open(dpath+\"user_vec.pkl\",\"rb\"))\n",
    "    item_vec = pickle.load(open(dpath+\"item_vec.pkl\",\"rb\"))\n",
    "    user_record = pickle.load(open(dpath+\"user_record.pkl\",\"rb\"))\n",
    "    item_record = pickle.load(open(dpath+\"item_record.pkl\",\"rb\"))\n",
    "    user_pop_mean,user_pop_std,item_pop_mean,item_pop_std\\\n",
    "                    = get_mean_std(user_record,item_record)\n",
    "    start = 0\n",
    "    lfm_reco = 0\n",
    "    outcols = []\n",
    "    for line in fin:\n",
    "        cols = line.strip().split(\",\")\n",
    "        #写入column name\n",
    "        if start == 0:\n",
    "            outcols = cols[:-1]+[\"user_pop\",\"item_pop\",\"user_rate\",\"item_rate\",\"lfm_reco\"]+[cols[-1]]\n",
    "            fout.write(\",\".join(outcols)+\"\\n\")\n",
    "            start += 1\n",
    "            continue\n",
    "        userid,itemid = cols[0],cols[1]\n",
    "        #计算user_pop，item_pop\n",
    "        user_pop = round((user_record[userid][1]-user_pop_mean)/user_pop_std,3)\n",
    "        item_pop = round((item_record[itemid][1]-item_pop_mean)/item_pop_std,3)\n",
    "        #计算user_rate，item_rate\n",
    "        user_rate = round(user_record[userid][1]/user_record[userid][0],3)\n",
    "        item_rate = round(item_record[itemid][1]/item_record[itemid][0],3)\n",
    "        #计算lfm_reco\n",
    "        if cols[0] in user_vec and cols[1] in item_vec:\n",
    "            lfm_reco = lfm_score(user_vec[cols[0]],item_vec[cols[1]])\n",
    "            lfm_reco = np.around(lfm_reco,decimals=5)\n",
    "            outcols = cols[:-1]+[str(user_pop)]+[str(item_pop)]+\\\n",
    "                      [str(user_rate)]+[str(item_rate)]+[str(lfm_reco)]+[cols[-1]]\n",
    "        else:\n",
    "            continue\n",
    "        #写入文件\n",
    "        fout.write(\",\".join(outcols)+\"\\n\")\n",
    "    fin.close()\n",
    "    fout.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 4min 48s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "generate_train_final(\"train_merge.csv\",\"train_final.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 生成最终测试文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "把user_pop,item_pop,user_rate,item_rate,lfm_reco写入文件保存，得到test_final.csv "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_test_final(input_file,output_file):\n",
    "    \"\"\"\n",
    "    generate final test file\n",
    "    args:\n",
    "        input_file: input file path\n",
    "        output_file: output file path\n",
    "    \"\"\"\n",
    "    fin = open(dpath+input_file,\"r+\")\n",
    "    fout = open(dpath+output_file,\"w+\")\n",
    "    user_vec = pickle.load(open(dpath+\"user_vec.pkl\",\"rb\"))\n",
    "    item_vec = pickle.load(open(dpath+\"item_vec.pkl\",\"rb\"))\n",
    "    user_record = pickle.load(open(dpath+\"user_record.pkl\",\"rb\"))\n",
    "    item_record = pickle.load(open(dpath+\"item_record.pkl\",\"rb\"))\n",
    "    user_pop_mean,user_pop_std,item_pop_mean,item_pop_std\\\n",
    "                    = get_mean_std(user_record,item_record)\n",
    "    start = 0\n",
    "    lfm_reco = 0\n",
    "    outcols = []\n",
    "    for line in fin:\n",
    "        cols = line.strip().split(\",\")\n",
    "        #写入column name\n",
    "        if start == 0:\n",
    "            outcols = cols+[\"user_pop\",\"item_pop\",\"user_rate\",\"item_rate\",\"lfm_reco\"]\n",
    "            fout.write(\",\".join(outcols)+\"\\n\")\n",
    "            start += 1\n",
    "            continue\n",
    "        userid,itemid = cols[1],cols[2]\n",
    "        #计算user_pop，user_rate\n",
    "        if userid in user_record:\n",
    "            user_pop = round((user_record[userid][1]-user_pop_mean)/user_pop_std,3)\n",
    "            user_rate = round(user_record[userid][1]/user_record[userid][0],3)\n",
    "        else:\n",
    "            user_pop = 0.\n",
    "            user_rate = 0.\n",
    "        #计算item_pop，item_rate\n",
    "        if itemid in item_record:\n",
    "            item_pop = round((item_record[itemid][1]-item_pop_mean)/item_pop_std,3)\n",
    "            item_rate = round(item_record[itemid][1]/item_record[itemid][0],3)\n",
    "        else:\n",
    "            item_pop = 0.\n",
    "            item_rate = 0.\n",
    "        #计算lfm_reco\n",
    "        if userid in user_vec and itemid in item_vec:\n",
    "            lfm_reco = lfm_score(user_vec[cols[1]],item_vec[cols[2]])\n",
    "            lfm_reco = np.around(lfm_reco,decimals=5)\n",
    "        #写入文件\n",
    "        outcols = cols+[str(user_pop)]+[str(item_pop)]+\\\n",
    "                  [str(user_rate)]+[str(item_rate)]+[str(lfm_reco)]\n",
    "        fout.write(\",\".join(outcols)+\"\\n\")\n",
    "    fin.close()\n",
    "    fout.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 1min 27s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "generate_test_final(\"test_merge.csv\",\"test_final.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_final = pd.read_csv(dpath+\"test_final.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>msno</th>\n",
       "      <th>song_id</th>\n",
       "      <th>source_system_tab</th>\n",
       "      <th>source_screen_name</th>\n",
       "      <th>source_type</th>\n",
       "      <th>city</th>\n",
       "      <th>bd</th>\n",
       "      <th>gender</th>\n",
       "      <th>registered_via</th>\n",
       "      <th>...</th>\n",
       "      <th>expiration_date</th>\n",
       "      <th>song_length</th>\n",
       "      <th>genre_ids</th>\n",
       "      <th>language</th>\n",
       "      <th>mult_genre</th>\n",
       "      <th>user_pop</th>\n",
       "      <th>item_pop</th>\n",
       "      <th>user_rate</th>\n",
       "      <th>item_rate</th>\n",
       "      <th>lfm_reco</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=</td>\n",
       "      <td>WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>...</td>\n",
       "      <td>13</td>\n",
       "      <td>-0.14208</td>\n",
       "      <td>24</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.401</td>\n",
       "      <td>2.944</td>\n",
       "      <td>0.366</td>\n",
       "      <td>0.504</td>\n",
       "      <td>0.38083</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=</td>\n",
       "      <td>y/rsZ9DC7FwK5F2PK2D5mj+aOBUJAjuu3dZ14NgE0vM=</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>...</td>\n",
       "      <td>13</td>\n",
       "      <td>0.45662</td>\n",
       "      <td>25</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.401</td>\n",
       "      <td>32.878</td>\n",
       "      <td>0.366</td>\n",
       "      <td>0.625</td>\n",
       "      <td>0.53185</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>/uQAlrAkaczV+nWCd2sPF2ekvXPRipV7q0l+gbLuxjw=</td>\n",
       "      <td>8eZLFOdGVdXBSqoAv5nsLigeH2BvKXzTQYtUM53I0k4=</td>\n",
       "      <td>0</td>\n",
       "      <td>16</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>...</td>\n",
       "      <td>12</td>\n",
       "      <td>0.42822</td>\n",
       "      <td>12</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.594</td>\n",
       "      <td>-0.072</td>\n",
       "      <td>0.144</td>\n",
       "      <td>0.400</td>\n",
       "      <td>-0.88885</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>1a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=</td>\n",
       "      <td>ztCf8thYsS4YN3GcIL/bvoxLm/T5mYBVKOO4C9NiVfQ=</td>\n",
       "      <td>7</td>\n",
       "      <td>11</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>...</td>\n",
       "      <td>13</td>\n",
       "      <td>0.23750</td>\n",
       "      <td>25</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.034</td>\n",
       "      <td>-0.029</td>\n",
       "      <td>0.296</td>\n",
       "      <td>0.226</td>\n",
       "      <td>-0.18739</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>1a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=</td>\n",
       "      <td>MKVMpslKcQhMaFEgcEQhEfi5+RZhMYlU3eRDpySrH8Y=</td>\n",
       "      <td>7</td>\n",
       "      <td>11</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>...</td>\n",
       "      <td>13</td>\n",
       "      <td>-0.30702</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.034</td>\n",
       "      <td>-0.072</td>\n",
       "      <td>0.296</td>\n",
       "      <td>0.400</td>\n",
       "      <td>0.98938</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   id                                          msno  \\\n",
       "0   0  V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=   \n",
       "1   1  V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=   \n",
       "2   2  /uQAlrAkaczV+nWCd2sPF2ekvXPRipV7q0l+gbLuxjw=   \n",
       "3   3  1a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=   \n",
       "4   4  1a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=   \n",
       "\n",
       "                                        song_id  source_system_tab  \\\n",
       "0  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=                  3   \n",
       "1  y/rsZ9DC7FwK5F2PK2D5mj+aOBUJAjuu3dZ14NgE0vM=                  3   \n",
       "2  8eZLFOdGVdXBSqoAv5nsLigeH2BvKXzTQYtUM53I0k4=                  0   \n",
       "3  ztCf8thYsS4YN3GcIL/bvoxLm/T5mYBVKOO4C9NiVfQ=                  7   \n",
       "4  MKVMpslKcQhMaFEgcEQhEfi5+RZhMYlU3eRDpySrH8Y=                  7   \n",
       "\n",
       "   source_screen_name  source_type  city  bd  gender  registered_via  \\\n",
       "0                   7            2     1   0       2               7   \n",
       "1                   7            2     1   0       2               7   \n",
       "2                  16            9     1   0       2               4   \n",
       "3                  11            7     3   5       1               9   \n",
       "4                  11            7     3   5       1               9   \n",
       "\n",
       "     ...     expiration_date  song_length  genre_ids  language  mult_genre  \\\n",
       "0    ...                  13     -0.14208         24         4           0   \n",
       "1    ...                  13      0.45662         25         4           0   \n",
       "2    ...                  12      0.42822         12         2           0   \n",
       "3    ...                  13      0.23750         25         8           0   \n",
       "4    ...                  13     -0.30702         32         0           0   \n",
       "\n",
       "   user_pop  item_pop  user_rate  item_rate  lfm_reco  \n",
       "0    -0.401     2.944      0.366      0.504   0.38083  \n",
       "1    -0.401    32.878      0.366      0.625   0.53185  \n",
       "2    -0.594    -0.072      0.144      0.400  -0.88885  \n",
       "3    -0.034    -0.029      0.296      0.226  -0.18739  \n",
       "4    -0.034    -0.072      0.296      0.400   0.98938  \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_final.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2556790, 21)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_final.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 遇到的坑"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "write() argument must be str, not bytes  \n",
    "解决：用\"wb\"格式打开"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算delta以及user_record时出现错误  \n",
    "TypeError: unsupported operand type(s) for +=: 'int' and 'str'  \n",
    "解决：target是str，需转换为int"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "user_rate和item_rate取值为何一样  \n",
    "outcols item_rate误写成user_rate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "train_final前5行为何为空  \n",
    "因为outcols没有满足判断条件，直接写入[]，解决：修改判断语句"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "'gbk' codec can't decode byte 0x80 in position 0: illegal multibyte sequence  \n",
    "pickle.load改用\"rb\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'  \n",
    "不能用\"lfm_reco\"，改为[\"lfm_reco\"]，这里不能用append"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TypeError: can only concatenate list (not \"str\") to list  \n",
    "cols[-1]改为[cols[-1]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TypeError: 'int' object is not callable  \n",
    "lfm_reco变量和函数重名，把计算推荐度的函数名改为lfm_score"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
